In a binary classification problem we have samples of data $x \in \mathbb{R}^n$, and we want to predict the value of a target variable $y \in \{0, 1\}$. For instance, a farmer might want to know if a $32 \times 32$ image $X \in \mathbb{R}^{32\times 32}$ contained a picture of a cucumber or not. We model absense or presense of a cucumber with outputs of $0$ or $1$ respectively.
The logistic regression approach to classification uses a hypothesis function $h_\theta$ of the form
$$ h_\theta(x) = g(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}. $$The parameter $\theta$ is what we're going to want to optimize. Since $h_\theta(x) \in [0, 1]$, we can interpret its value as the probability of $x$ having a certain label:
\begin{align*} \mathbb{P}(y = 1 ~|~ x, \theta) &= h_\theta(x) \\ \mathbb{P}(y = 0 ~|~ x, \theta) &= 1 - h_\theta(x). \end{align*}So if $h_\theta(x) \geq 0.5$, we predict $y =1$, otherwise we predict $y = 0$. Written differently, this is
$$ \mathbb{P}(y ~|~ x, \theta) = h_\theta(x)^y (1 - h_\theta(x))^{1-y}. $$Now, suppose we have $m$ independently generated samples in our dataset. As usual, we arrange these $m$ samples into an $m\times n$ matrix whose rows each represent individual samples. The likelihood of the parameter $\theta$ is given by
\begin{align*} L(\theta) &= \mathbb{P}(y ~|~ X, \theta) \\ &= \prod_{i=1}^m \mathbb{P}(y^{(i)} ~|~ X^{(i)}, \theta) \\ &= \prod_{i=1}^m h_\theta(x^{(i)})^{y^{(i)}} (1 - h_\theta(x^{(i)}))^{1-y^{(i)}}. \end{align*}Our goal is then to choose $\theta$ to maximize this likelihood. In practice, it is easier to maximize the log-likelihood function:
\begin{align*} l(\theta) &= \log(L(\theta)) \\ &= \sum_{i=1}^m y^{(i)} \log [ h_\theta(x^{(i)})] + (1 - y^{(i)}) \log[1 - h_\theta(x^{(i)})]. \end{align*}We can maximize the log-likelihood by performing stochastic gradient ascent. In other words, we choose a training pair $(x, y) = (x^{(i)}, y^{(i)})$ at random, and compute the gradient of $l$ at this pair using the formula:
\begin{align*} \frac{\partial }{\partial\theta_j} l (\theta) &= \left(y \frac{1}{g(\theta^T x} - (1 - y) \frac{1}{1 - g(\theta^T x)}\right) \frac{\partial}{\partial\theta_j}g(\theta^T x) \\ &= \left(y \frac{1}{g(\theta^T x} - (1 - y) \frac{1}{1 - g(\theta^T x)}\right) g(\theta^T x)(1 - g(\theta^Tx)) \frac{\partial}{\partial\theta_j}\theta^Tx \\ &= (y(1 - g(\theta^Tx)) - (1 - y)g(\theta^Tx))x_j \\ &= (y - h_\theta(x))x_j. \end{align*}Above we used the derivative identity $g'(x) = g(z)(1-g(z))$. To choose new $\theta$ values, we want to take a small step in the direction of the gradient (since we are maximizing $l(\theta)$). This gives the update rule of
$$ \theta_j = \theta_j + \alpha (y^{(i)} - h_\theta(x^{(i)}))x_j^{(i)} $$where $\alpha$ is the learning rate parameter.
The sklearn breast cancer dataset consists of $569$ $30$-dimensional data points. The goal is to classify each data point as representing either a malignant or benign tumor. You can load the data with the following code:
In [1]:
from sklearn import datasets
bc = datasets.load_breast_cancer()
samples, targets = bc.data, bc.target
python implementation of the logistic regression function $(\theta, x) \mapsto h_\theta(x)$. sklearn.metrics.accuracy_score function may come in handy here).Principal component analysis (PCA) is a dimensionality reduction technique. The idea is to project the data down to lower dimension by 'dropping' those directions/dimensions that don't contain much variance. For instance, consider the following sample of data points in 2D:
The goal of a PCA in this case would be to project all of the data points onto the axis spanned by the longer arrow; since the short arrow is orthogonal to the large one, it would be ideal if we could project along the short arrow. The new dataset will be 1-dimensional, and since most of the variation in the data was along the direction spanned by the long arrow, hopefully we haven't lost much information.
For more details about the mathematics of PCA, see Andrew Ng's great notes here.
The sklearn digits dataset contains images of handwritten digits, much like the famous MNIST dataset. Here's a sample:
In [ ]:
import matplotlib.pyplot as plt
import numpy as np
digits = datasets.load_digits()
samples, targets = digits.data, digits.target
%matplotlib inline
plt.imshow(np.reshape(samples[0], (8,8)), cmap='Greys')
The images are each 8x8, for a total number of 64 dimensions.
sklearn's PCA implementation to reduce the dimensionality of the digits dataset (see sklearn.decomposition.PCA).sklearn.neighbors.KNeighborsClassifier).
In [ ]: